home *** CD-ROM | disk | FTP | other *** search
Text File | 1991-04-20 | 44.9 KB | 1,389 lines |
-
- RFC: 817
-
-
-
- MODULARITY AND EFFICIENCY IN PROTOCOL IMPLEMENTATION
-
- David D. Clark
- MIT Laboratory for Computer Science
- Computer Systems and Communications Group
- July, 1982
-
-
- 1. Introduction
-
-
- Many protocol implementers have made the unpleasant discovery that
-
- their packages do not run quite as fast as they had hoped. The blame
-
- for this widely observed problem has been attributed to a variety of
-
- causes, ranging from details in the design of the protocol to the
-
- underlying structure of the host operating system. This RFC will
-
- discuss some of the commonly encountered reasons why protocol
-
- implementations seem to run slowly.
-
-
- Experience suggests that one of the most important factors in
-
- determining the performance of an implementation is the manner in which
-
- that implementation is modularized and integrated into the host
-
- operating system. For this reason, it is useful to discuss the question
-
- of how an implementation is structured at the same time that we consider
-
- how it will perform. In fact, this RFC will argue that modularity is
-
- one of the chief villains in attempting to obtain good performance, so
-
- that the designer is faced with a delicate and inevitable tradeoff
-
- between good structure and good performance. Further, the single factor
-
- which most strongly determines how well this conflict can be resolved is
-
- not the protocol but the operating system.
-
- 2
-
-
- 2. Efficiency Considerations
-
-
- There are many aspects to efficiency. One aspect is sending data
-
- at minimum transmission cost, which is a critical aspect of common
-
- carrier communications, if not in local area network communications.
-
- Another aspect is sending data at a high rate, which may not be possible
-
- at all if the net is very slow, but which may be the one central design
-
- constraint when taking advantage of a local net with high raw bandwidth.
-
- The final consideration is doing the above with minimum expenditure of
-
- computer resources. This last may be necessary to achieve high speed,
-
- but in the case of the slow net may be important only in that the
-
- resources used up, for example cpu cycles, are costly or otherwise
-
- needed. It is worth pointing out that these different goals often
-
- conflict; for example it is often possible to trade off efficient use of
-
- the computer against efficient use of the network. Thus, there may be
-
- no such thing as a successful general purpose protocol implementation.
-
-
- The simplest measure of performance is throughput, measured in bits
-
- per second. It is worth doing a few simple computations in order to get
-
- a feeling for the magnitude of the problems involved. Assume that data
-
- is being sent from one machine to another in packets of 576 bytes, the
-
- maximum generally acceptable internet packet size. Allowing for header
-
- overhead, this packet size permits 4288 bits in each packet. If a
-
- useful throughput of 10,000 bits per second is desired, then a data
-
- bearing packet must leave the sending host about every 430 milliseconds,
-
- a little over two per second. This is clearly not difficult to achieve.
-
- However, if one wishes to achieve 100 kilobits per second throughput,
-
- 3
-
-
- the packet must leave the host every 43 milliseconds, and to achieve one
-
- megabit per second, which is not at all unreasonable on a high-speed
-
- local net, the packets must be spaced no more than 4.3 milliseconds.
-
-
- These latter numbers are a slightly more alarming goal for which to
-
- set one's sights. Many operating systems take a substantial fraction of
-
- a millisecond just to service an interrupt. If the protocol has been
-
- structured as a process, it is necessary to go through a process
-
- scheduling before the protocol code can even begin to run. If any piece
-
- of a protocol package or its data must be fetched from disk, real time
-
- delays of between 30 to 100 milliseconds can be expected. If the
-
- protocol must compete for cpu resources with other processes of the
-
- system, it may be necessary to wait a scheduling quantum before the
-
- protocol can run. Many systems have a scheduling quantum of 100
-
- milliseconds or more. Considering these sorts of numbers, it becomes
-
- immediately clear that the protocol must be fitted into the operating
-
- system in a thorough and effective manner if any like reasonable
-
- throughput is to be achieved.
-
-
- There is one obvious conclusion immediately suggested by even this
-
- simple analysis. Except in very special circumstances, when many
-
- packets are being processed at once, the cost of processing a packet is
-
- dominated by factors, such as cpu scheduling, which are independent of
-
- the packet size. This suggests two general rules which any
-
- implementation ought to obey. First, send data in large packets.
-
- Obviously, if processing time per packet is a constant, then throughput
-
- will be directly proportional to the packet size. Second, never send an
-
- 4
-
-
- unneeded packet. Unneeded packets use up just as many resources as a
-
- packet full of data, but perform no useful function. RFC 813, "Window
-
- and Acknowledgement Strategy in TCP", discusses one aspect of reducing
-
- the number of packets sent per useful data byte. This document will
-
- mention other attacks on the same problem.
-
-
- The above analysis suggests that there are two main parts to the
-
- problem of achieving good protocol performance. The first has to do
-
- with how the protocol implementation is integrated into the host
-
- operating system. The second has to do with how the protocol package
-
- itself is organized internally. This document will consider each of
-
- these topics in turn.
-
-
- 3. The Protocol vs. the Operating System
-
-
- There are normally three reasonable ways in which to add a protocol
-
- to an operating system. The protocol can be in a process that is
-
- provided by the operating system, or it can be part of the kernel of the
-
- operating system itself, or it can be put in a separate communications
-
- processor or front end machine. This decision is strongly influenced by
-
- details of hardware architecture and operating system design; each of
-
- these three approaches has its own advantages and disadvantages.
-
-
- The "process" is the abstraction which most operating systems use
-
- to provide the execution environment for user programs. A very simple
-
- path for implementing a protocol is to obtain a process from the
-
- operating system and implement the protocol to run in it.
-
- Superficially, this approach has a number of advantages. Since
-
- 5
-
-
- modifications to the kernel are not required, the job can be done by
-
- someone who is not an expert in the kernel structure. Since it is often
-
- impossible to find somebody who is experienced both in the structure of
-
- the operating system and the structure of the protocol, this path, from
-
- a management point of view, is often extremely appealing. Unfortunately,
-
- putting a protocol in a process has a number of disadvantages, related
-
- to both structure and performance. First, as was discussed above,
-
- process scheduling can be a significant source of real-time delay.
-
- There is not only the actual cost of going through the scheduler, but
-
- the problem that the operating system may not have the right sort of
-
- priority tools to bring the process into execution quickly whenever
-
- there is work to be done.
-
-
- Structurally, the difficulty with putting a protocol in a process
-
- is that the protocol may be providing services, for example support of
-
- data streams, which are normally obtained by going to special kernel
-
- entry points. Depending on the generality of the operating system, it
-
- may be impossible to take a program which is accustomed to reading
-
- through a kernel entry point, and redirect it so it is reading the data
-
- from a process. The most extreme example of this problem occurs when
-
- implementing server telnet. In almost all systems, the device handler
-
- for the locally attached teletypes is located inside the kernel, and
-
- programs read and write from their teletype by making kernel calls. If
-
- server telnet is implemented in a process, it is then necessary to take
-
- the data streams provided by server telnet and somehow get them back
-
- down inside the kernel so that they mimic the interface provided by
-
- local teletypes. It is usually the case that special kernel
-
- 6
-
-
- modification is necessary to achieve this structure, which somewhat
-
- defeats the benefit of having removed the protocol from the kernel in
-
- the first place.
-
-
- Clearly, then, there are advantages to putting the protocol package
-
- in the kernel. Structurally, it is reasonable to view the network as a
-
- device, and device drivers are traditionally contained in the kernel.
-
- Presumably, the problems associated with process scheduling can be
-
- sidesteped, at least to a certain extent, by placing the code inside the
-
- kernel. And it is obviously easier to make the server telnet channels
-
- mimic the local teletype channels if they are both realized in the same
-
- level in the kernel.
-
-
- However, implementation of protocols in the kernel has its own set
-
- of pitfalls. First, network protocols have a characteristic which is
-
- shared by almost no other device: they require rather complex actions
-
- to be performed as a result of a timeout. The problem with this
-
- requirement is that the kernel often has no facility by which a program
-
- can be brought into execution as a result of the timer event. What is
-
- really needed, of course, is a special sort of process inside the
-
- kernel. Most systems lack this mechanism. Failing that, the only
-
- execution mechanism available is to run at interrupt time.
-
-
- There are substantial drawbacks to implementing a protocol to run
-
- at interrupt time. First, the actions performed may be somewhat complex
-
- and time consuming, compared to the maximum amount of time that the
-
- operating system is prepared to spend servicing an interrupt. Problems
-
- can arise if interrupts are masked for too long. This is particularly
-
- 7
-
-
- bad when running as a result of a clock interrupt, which can imply that
-
- the clock interrupt is masked. Second, the environment provided by an
-
- interrupt handler is usually extremely primitive compared to the
-
- environment of a process. There are usually a variety of system
-
- facilities which are unavailable while running in an interrupt handler.
-
- The most important of these is the ability to suspend execution pending
-
- the arrival of some event or message. It is a cardinal rule of almost
-
- every known operating system that one must not invoke the scheduler
-
- while running in an interrupt handler. Thus, the programmer who is
-
- forced to implement all or part of his protocol package as an interrupt
-
- handler must be the best sort of expert in the operating system
-
- involved, and must be prepared for development sessions filled with
-
- obscure bugs which crash not just the protocol package but the entire
-
- operating system.
-
-
- A final problem with processing at interrupt time is that the
-
- system scheduler has no control over the percentage of system time used
-
- by the protocol handler. If a large number of packets arrive, from a
-
- foreign host that is either malfunctioning or fast, all of the time may
-
- be spent in the interrupt handler, effectively killing the system.
-
-
- There are other problems associated with putting protocols into an
-
- operating system kernel. The simplest problem often encountered is that
-
- the kernel address space is simply too small to hold the piece of code
-
- in question. This is a rather artificial sort of problem, but it is a
-
- severe problem none the less in many machines. It is an appallingly
-
- unpleasant experience to do an implementation with the knowledge that
-
- 8
-
-
- for every byte of new feature put in one must find some other byte of
-
- old feature to throw out. It is hopeless to expect an effective and
-
- general implementation under this kind of constraint. Another problem
-
- is that the protocol package, once it is thoroughly entwined in the
-
- operating system, may need to be redone every time the operating system
-
- changes. If the protocol and the operating system are not maintained by
-
- the same group, this makes maintenance of the protocol package a
-
- perpetual headache.
-
-
- The third option for protocol implementation is to take the
-
- protocol package and move it outside the machine entirely, on to a
-
- separate processor dedicated to this kind of task. Such a machine is
-
- often described as a communications processor or a front-end processor.
-
- There are several advantages to this approach. First, the operating
-
- system on the communications processor can be tailored for precisely
-
- this kind of task. This makes the job of implementation much easier.
-
- Second, one does not need to redo the task for every machine to which
-
- the protocol is to be added. It may be possible to reuse the same
-
- front-end machine on different host computers. Since the task need not
-
- be done as many times, one might hope that more attention could be paid
-
- to doing it right. Given a careful implementation in an environment
-
- which is optimized for this kind of task, the resulting package should
-
- turn out to be very efficient. Unfortunately, there are also problems
-
- with this approach. There is, of course, a financial problem associated
-
- with buying an additional computer. In many cases, this is not a
-
- problem at all since the cost is negligible compared to what the
-
- programmer would cost to do the job in the mainframe itself. More
-
- 9
-
-
- fundamentally, the communications processor approach does not completely
-
- sidestep any of the problems raised above. The reason is that the
-
- communications processor, since it is a separate machine, must be
-
- attached to the mainframe by some mechanism. Whatever that mechanism,
-
- code is required in the mainframe to deal with it. It can be argued
-
- that the program to deal with the communications processor is simpler
-
- than the program to implement the entire protocol package. Even if that
-
- is so, the communications processor interface package is still a
-
- protocol in nature, with all of the same structural problems. Thus, all
-
- of the issues raised above must still be faced. In addition to those
-
- problems, there are some other, more subtle problems associated with an
-
- outboard implementation of a protocol. We will return to these problems
-
- later.
-
-
- There is a way of attaching a communications processor to a
-
- mainframe host which sidesteps all of the mainframe implementation
-
- problems, which is to use some preexisting interface on the host machine
-
- as the port by which a communications processor is attached. This
-
- strategy is often used as a last stage of desperation when the software
-
- on the host computer is so intractable that it cannot be changed in any
-
- way. Unfortunately, it is almost inevitably the case that all of the
-
- available interfaces are totally unsuitable for this purpose, so the
-
- result is unsatisfactory at best. The most common way in which this
-
- form of attachment occurs is when a network connection is being used to
-
- mimic local teletypes. In this case, the front-end processor can be
-
- attached to the mainframe by simply providing a number of wires out of
-
- the front-end processor, each corresponding to a connection, which are
-
- 10
-
-
- plugged into teletype ports on the mainframe computer. (Because of the
-
- appearance of the physical configuration which results from this
-
- arrangement, Michael Padlipsky has described this as the "milking
-
- machine" approach to computer networking.) This strategy solves the
-
- immediate problem of providing remote access to a host, but it is
-
- extremely inflexible. The channels being provided to the host are
-
- restricted by the host software to one purpose only, remote login. It
-
- is impossible to use them for any other purpose, such as file transfer
-
- or sending mail, so the host is integrated into the network environment
-
- in an extremely limited and inflexible manner. If this is the best that
-
- can be done, then it should be tolerated. Otherwise, implementors
-
- should be strongly encouraged to take a more flexible approach.
-
-
- 4. Protocol Layering
-
-
- The previous discussion suggested that there was a decision to be
-
- made as to where a protocol ought to be implemented. In fact, the
-
- decision is much more complicated than that, for the goal is not to
-
- implement a single protocol, but to implement a whole family of protocol
-
- layers, starting with a device driver or local network driver at the
-
- bottom, then IP and TCP, and eventually reaching the application
-
- specific protocol, such as Telnet, FTP and SMTP on the top. Clearly,
-
- the bottommost of these layers is somewhere within the kernel, since the
-
- physical device driver for the net is almost inevitably located there.
-
- Equally clearly, the top layers of this package, which provide the user
-
- his ability to perform the remote login function or to send mail, are
-
- not entirely contained within the kernel. Thus, the question is not
-
- 11
-
-
- whether the protocol family shall be inside or outside the kernel, but
-
- how it shall be sliced in two between that part inside and that part
-
- outside.
-
-
- Since protocols come nicely layered, an obvious proposal is that
-
- one of the layer interfaces should be the point at which the inside and
-
- outside components are sliced apart. Most systems have been implemented
-
- in this way, and many have been made to work quite effectively. One
-
- obvious place to slice is at the upper interface of TCP. Since TCP
-
- provides a bidirectional byte stream, which is somewhat similar to the
-
- I/O facility provided by most operating systems, it is possible to make
-
- the interface to TCP almost mimic the interface to other existing
-
- devices. Except in the matter of opening a connection, and dealing with
-
- peculiar failures, the software using TCP need not know that it is a
-
- network connection, rather than a local I/O stream that is providing the
-
- communications function. This approach does put TCP inside the kernel,
-
- which raises all the problems addressed above. It also raises the
-
- problem that the interface to the IP layer can, if the programmer is not
-
- careful, become excessively buried inside the kernel. It must be
-
- remembered that things other than TCP are expected to run on top of IP.
-
- The IP interface must be made accessible, even if TCP sits on top of it
-
- inside the kernel.
-
-
- Another obvious place to slice is above Telnet. The advantage of
-
- slicing above Telnet is that it solves the problem of having remote
-
- login channels emulate local teletype channels. The disadvantage of
-
- putting Telnet into the kernel is that the amount of code which has now
-
- 12
-
-
- been included there is getting remarkably large. In some early
-
- implementations, the size of the network package, when one includes
-
- protocols at the level of Telnet, rivals the size of the rest of the
-
- supervisor. This leads to vague feelings that all is not right.
-
-
- Any attempt to slice through a lower layer boundary, for example
-
- between internet and TCP, reveals one fundamental problem. The TCP
-
- layer, as well as the IP layer, performs a demultiplexing function on
-
- incoming datagrams. Until the TCP header has been examined, it is not
-
- possible to know for which user the packet is ultimately destined.
-
- Therefore, if TCP, as a whole, is moved outside the kernel, it is
-
- necessary to create one separate process called the TCP process, which
-
- performs the TCP multiplexing function, and probably all of the rest of
-
- TCP processing as well. This means that incoming data destined for a
-
- user process involves not just a scheduling of the user process, but
-
- scheduling the TCP process first.
-
-
- This suggests an alternative structuring strategy which slices
-
- through the protocols, not along an established layer boundary, but
-
- along a functional boundary having to do with demultiplexing. In this
-
- approach, certain parts of IP and certain parts of TCP are placed in the
-
- kernel. The amount of code placed there is sufficient so that when an
-
- incoming datagram arrives, it is possible to know for which process that
-
- datagram is ultimately destined. The datagram is then routed directly
-
- to the final process, where additional IP and TCP processing is
-
- performed on it. This removes from the kernel any requirement for timer
-
- based actions, since they can be done by the process provided by the
-
- 13
-
-
- user. This structure has the additional advantage of reducing the
-
- amount of code required in the kernel, so that it is suitable for
-
- systems where kernel space is at a premium. The RFC 814, titled "Names,
-
- Addresses, Ports, and Routes," discusses this rather orthogonal slicing
-
- strategy in more detail.
-
-
- A related discussion of protocol layering and multiplexing can be
-
- found in Cohen and Postel [1].
-
-
- 5. Breaking Down the Barriers
-
-
- In fact, the implementor should be sensitive to the possibility of
-
- even more peculiar slicing strategies in dividing up the various
-
- protocol layers between the kernel and the one or more user processes.
-
- The result of the strategy proposed above was that part of TCP should
-
- execute in the process of the user. In other words, instead of having
-
- one TCP process for the system, there is one TCP process per connection.
-
- Given this architecture, it is not longer necessary to imagine that all
-
- of the TCPs are identical. One TCP could be optimized for high
-
- throughput applications, such as file transfer. Another TCP could be
-
- optimized for small low delay applications such as Telnet. In fact, it
-
- would be possible to produce a TCP which was somewhat integrated with
-
- the Telnet or FTP on top of it. Such an integration is extremely
-
- important, for it can lead to a kind of efficiency which more
-
- traditional structures are incapable of producing. Earlier, this paper
-
- pointed out that one of the important rules to achieving efficiency was
-
- to send the minimum number of packets for a given amount of data. The
-
- idea of protocol layering interacts very strongly (and poorly) with this
-
- 14
-
-
- goal, because independent layers have independent ideas about when
-
- packets should be sent, and unless these layers can somehow be brought
-
- into cooperation, additional packets will flow. The best example of
-
- this is the operation of server telnet in a character at a time remote
-
- echo mode on top of TCP. When a packet containing a character arrives
-
- at a server host, each layer has a different response to that packet.
-
- TCP has an obligation to acknowledge the packet. Either server telnet
-
- or the application layer above has an obligation to echo the character
-
- received in the packet. If the character is a Telnet control sequence,
-
- then Telnet has additional actions which it must perform in response to
-
- the packet. The result of this, in most implementations, is that
-
- several packets are sent back in response to the one arriving packet.
-
- Combining all of these return messages into one packet is important for
-
- several reasons. First, of course, it reduces the number of packets
-
- being sent over the net, which directly reduces the charges incurred for
-
- many common carrier tariff structures. Second, it reduces the number of
-
- scheduling actions which will occur inside both hosts, which, as was
-
- discussed above, is extremely important in improving throughput.
-
-
- The way to achieve this goal of packet sharing is to break down the
-
- barrier between the layers of the protocols, in a very restrained and
-
- careful manner, so that a limited amount of information can leak across
-
- the barrier to enable one layer to optimize its behavior with respect to
-
- the desires of the layers above and below it. For example, it would
-
- represent an improvement if TCP, when it received a packet, could ask
-
- the layer above whether or not it would be worth pausing for a few
-
- milliseconds before sending an acknowledgement in order to see if the
-
- 15
-
-
- upper layer would have any outgoing data to send. Dallying before
-
- sending the acknowledgement produces precisely the right sort of
-
- optimization if the client of TCP is server Telnet. However, dallying
-
- before sending an acknowledgement is absolutely unacceptable if TCP is
-
- being used for file transfer, for in file transfer there is almost never
-
- data flowing in the reverse direction, and the delay in sending the
-
- acknowledgement probably translates directly into a delay in obtaining
-
- the next packets. Thus, TCP must know a little about the layers above
-
- it to adjust its performance as needed.
-
-
- It would be possible to imagine a general purpose TCP which was
-
- equipped with all sorts of special mechanisms by which it would query
-
- the layer above and modify its behavior accordingly. In the structures
-
- suggested above, in which there is not one but several TCPs, the TCP can
-
- simply be modified so that it produces the correct behavior as a matter
-
- of course. This structure has the disadvantage that there will be
-
- several implementations of TCP existing on a single machine, which can
-
- mean more maintenance headaches if a problem is found where TCP needs to
-
- be changed. However, it is probably the case that each of the TCPs will
-
- be substantially simpler than the general purpose TCP which would
-
- otherwise have been built. There are some experimental projects
-
- currently under way which suggest that this approach may make designing
-
- of a TCP, or almost any other layer, substantially easier, so that the
-
- total effort involved in bringing up a complete package is actually less
-
- if this approach is followed. This approach is by no means generally
-
- accepted, but deserves some consideration.
-
- 16
-
-
- The general conclusion to be drawn from this sort of consideration
-
- is that a layer boundary has both a benefit and a penalty. A visible
-
- layer boundary, with a well specified interface, provides a form of
-
- isolation between two layers which allows one to be changed with the
-
- confidence that the other one will not stop working as a result.
-
- However, a firm layer boundary almost inevitably leads to inefficient
-
- operation. This can easily be seen by analogy with other aspects of
-
- operating systems. Consider, for example, file systems. A typical
-
- operating system provides a file system, which is a highly abstracted
-
- representation of a disk. The interface is highly formalized, and
-
- presumed to be highly stable. This makes it very easy for naive users
-
- to have access to disks without having to write a great deal of
-
- software. The existence of a file system is clearly beneficial. On the
-
- other hand, it is clear that the restricted interface to a file system
-
- almost inevitably leads to inefficiency. If the interface is organized
-
- as a sequential read and write of bytes, then there will be people who
-
- wish to do high throughput transfers who cannot achieve their goal. If
-
- the interface is a virtual memory interface, then other users will
-
- regret the necessity of building a byte stream interface on top of the
-
- memory mapped file. The most objectionable inefficiency results when a
-
- highly sophisticated package, such as a data base management package,
-
- must be built on top of an existing operating system. Almost
-
- inevitably, the implementors of the database system attempt to reject
-
- the file system and obtain direct access to the disks. They have
-
- sacrificed modularity for efficiency.
-
-
- The same conflict appears in networking, in a rather extreme form.
-
- 17
-
-
- The concept of a protocol is still unknown and frightening to most naive
-
- programmers. The idea that they might have to implement a protocol, or
-
- even part of a protocol, as part of some application package, is a
-
- dreadful thought. And thus there is great pressure to hide the function
-
- of the net behind a very hard barrier. On the other hand, the kind of
-
- inefficiency which results from this is a particularly undesirable sort
-
- of inefficiency, for it shows up, among other things, in increasing the
-
- cost of the communications resource used up to achieve the application
-
- goal. In cases where one must pay for one's communications costs, they
-
- usually turn out to be the dominant cost within the system. Thus, doing
-
- an excessively good job of packaging up the protocols in an inflexible
-
- manner has a direct impact on increasing the cost of the critical
-
- resource within the system. This is a dilemma which will probably only
-
- be solved when programmers become somewhat less alarmed about protocols,
-
- so that they are willing to weave a certain amount of protocol structure
-
- into their application program, much as application programs today weave
-
- parts of database management systems into the structure of their
-
- application program.
-
-
- An extreme example of putting the protocol package behind a firm
-
- layer boundary occurs when the protocol package is relegated to a front-
-
- end processor. In this case the interface to the protocol is some other
-
- protocol. It is difficult to imagine how to build close cooperation
-
- between layers when they are that far separated. Realistically, one of
-
- the prices which must be associated with an implementation so physically
-
- modularized is that the performance will suffer as a result. Of course,
-
- a separate processor for protocols could be very closely integrated into
-
- 18
-
-
- the mainframe architecture, with interprocessor co-ordination signals,
-
- shared memory, and similar features. Such a physical modularity might
-
- work very well, but there is little documented experience with this
-
- closely coupled architecture for protocol support.
-
-
- 6. Efficiency of Protocol Processing
-
-
- To this point, this document has considered how a protocol package
-
- should be broken into modules, and how those modules should be
-
- distributed between free standing machines, the operating system kernel,
-
- and one or more user processes. It is now time to consider the other
-
- half of the efficiency question, which is what can be done to speed the
-
- execution of those programs that actually implement the protocols. We
-
- will make some specific observations about TCP and IP, and then conclude
-
- with a few generalities.
-
-
- IP is a simple protocol, especially with respect to the processing
-
- of normal packets, so it should be easy to get it to perform
-
- efficiently. The only area of any complexity related to actual packet
-
- processing has to do with fragmentation and reassembly. The reader is
-
- referred to RFC 815, titled "IP Datagram Reassembly Algorithms", for
-
- specific consideration of this point.
-
-
- Most costs in the IP layer come from table look up functions, as
-
- opposed to packet processing functions. An outgoing packet requires two
-
- translation functions to be performed. The internet address must be
-
- translated to a target gateway, and a gateway address must be translated
-
- to a local network number (if the host is attached to more than one
-
- 19
-
-
- network). It is easy to build a simple implementation of these table
-
- look up functions that in fact performs very poorly. The programmer
-
- should keep in mind that there may be as many as a thousand network
-
- numbers in a typical configuration. Linear searching of a thousand
-
- entry table on every packet is extremely unsuitable. In fact, it may be
-
- worth asking TCP to cache a hint for each connection, which can be
-
- handed down to IP each time a packet is sent, to try to avoid the
-
- overhead of a table look up.
-
-
- TCP is a more complex protocol, and presents many more
-
- opportunities for getting things wrong. There is one area which is
-
- generally accepted as causing noticeable and substantial overhead as
-
- part of TCP processing. This is computation of the checksum. It would
-
- be nice if this cost could be avoided somehow, but the idea of an end-
-
- to-end checksum is absolutely central to the functioning of TCP. No
-
- host implementor should think of omitting the validation of a checksum
-
- on incoming data.
-
-
- Various clever tricks have been used to try to minimize the cost of
-
- computing the checksum. If it is possible to add additional microcoded
-
- instructions to the machine, a checksum instruction is the most obvious
-
- candidate. Since computing the checksum involves picking up every byte
-
- of the segment and examining it, it is possible to combine the operation
-
- of computing the checksum with the operation of copying the segment from
-
- one location to another. Since a number of data copies are probably
-
- already required as part of the processing structure, this kind of
-
- sharing might conceivably pay off if it didn't cause too much trouble to
-
- 20
-
-
- the modularity of the program. Finally, computation of the checksum
-
- seems to be one place where careful attention to the details of the
-
- algorithm used can make a drastic difference in the throughput of the
-
- program. The Multics system provides one of the best case studies of
-
- this, since Multics is about as poorly organized to perform this
-
- function as any machine implementing TCP. Multics is a 36-bit word
-
- machine, with four 9-bit bytes per word. The eight-bit bytes of a TCP
-
- segment are laid down packed in memory, ignoring word boundaries. This
-
- means that when it is necessary to pick up the data as a set of 16-bit
-
- units for the purpose of adding them to compute checksums, horrible
-
- masking and shifting is required for each 16-bit value. An early
-
- version of a program using this strategy required 6 milliseconds to
-
- checksum a 576-byte segment. Obviously, at this point, checksum
-
- computation was becoming the central bottleneck to throughput. A more
-
- careful recoding of this algorithm reduced the checksum processing time
-
- to less than one millisecond. The strategy used was extremely dirty.
-
- It involved adding up carefully selected words of the area in which the
-
- data lay, knowing that for those particular words, the 16-bit values
-
- were properly aligned inside the words. Only after the addition had
-
- been done were the various sums shifted, and finally added to produce
-
- the eventual checksum. This kind of highly specialized programming is
-
- probably not acceptable if used everywhere within an operating system.
-
- It is clearly appropriate for one highly localized function which can be
-
- clearly identified as an extreme performance bottleneck.
-
-
- Another area of TCP processing which may cause performance problems
-
- is the overhead of examining all of the possible flags and options which
-
- 21
-
-
- occur in each incoming packet. One paper, by Bunch and Day [2], asserts
-
- that the overhead of packet header processing is actually an important
-
- limiting factor in throughput computation. Not all measurement
-
- experiments have tended to support this result. To whatever extent it
-
- is true, however, there is an obvious strategy which the implementor
-
- ought to use in designing his program. He should build his program to
-
- optimize the expected case. It is easy, especially when first designing
-
- a program, to pay equal attention to all of the possible outcomes of
-
- every test. In practice, however, few of these will ever happen. A TCP
-
- should be built on the assumption that the next packet to arrive will
-
- have absolutely nothing special about it, and will be the next one
-
- expected in the sequence space. One or two tests are sufficient to
-
- determine that the expected set of control flags are on. (The ACK flag
-
- should be on; the Push flag may or may not be on. No other flags should
-
- be on.) One test is sufficient to determine that the sequence number of
-
- the incoming packet is one greater than the last sequence number
-
- received. In almost every case, that will be the actual result. Again,
-
- using the Multics system as an example, failure to optimize the case of
-
- receiving the expected sequence number had a detectable effect on the
-
- performance of the system. The particular problem arose when a number
-
- of packets arrived at once. TCP attempted to process all of these
-
- packets before awaking the user. As a result, by the time the last
-
- packet arrived, there was a threaded list of packets which had several
-
- items on it. When a new packet arrived, the list was searched to find
-
- the location into which the packet should be inserted. Obviously, the
-
- list should be searched from highest sequence number to lowest sequence
-
- 22
-
-
- number, because one is expecting to receive a packet which comes after
-
- those already received. By mistake, the list was searched from front to
-
- back, starting with the packets with the lowest sequence number. The
-
- amount of time spent searching this list backwards was easily detectable
-
- in the metering measurements.
-
-
- Other data structures can be organized to optimize the action which
-
- is normally taken on them. For example, the retransmission queue is
-
- very seldom actually used for retransmission, so it should not be
-
- organized to optimize that action. In fact, it should be organized to
-
- optimized the discarding of things from it when the acknowledgement
-
- arrives. In many cases, the easiest way to do this is not to save the
-
- packet at all, but to reconstruct it only if it needs to be
-
- retransmitted, starting from the data as it was originally buffered by
-
- the user.
-
-
- There is another generality, at least as important as optimizing
-
- the common case, which is to avoid copying data any more times than
-
- necessary. One more result from the Multics TCP may prove enlightening
-
- here. Multics takes between two and three milliseconds within the TCP
-
- layer to process an incoming packet, depending on its size. For a 576-
-
- byte packet, the three milliseconds is used up approximately as follows.
-
- One millisecond is used computing the checksum. Six hundred
-
- microseconds is spent copying the data. (The data is copied twice, at
-
- .3 milliseconds a copy.) One of those copy operations could correctly
-
- be included as part of the checksum cost, since it is done to get the
-
- data on a known word boundary to optimize the checksum algorithm.
-
- 23
-
-
- However, the copy also performs another necessary transfer at the same
-
- time. Header processing and packet resequencing takes .7 milliseconds.
-
- The rest of the time is used in miscellaneous processing, such as
-
- removing packets from the retransmission queue which are acknowledged by
-
- this packet. Data copying is the second most expensive single operation
-
- after data checksuming. Some implementations, often because of an
-
- excessively layered modularity, end up copying the data around a great
-
- deal. Other implementations end up copying the data because there is no
-
- shared memory between processes, and the data must be moved from process
-
- to process via a kernel operation. Unless the amount of this activity
-
- is kept strictly under control, it will quickly become the major
-
- performance bottleneck.
-
-
- 7. Conclusions
-
-
- This document has addressed two aspects of obtaining performance
-
- from a protocol implementation, the way in which the protocol is layered
-
- and integrated into the operating system, and the way in which the
-
- detailed handling of the packet is optimized. It would be nice if one
-
- or the other of these costs would completely dominate, so that all of
-
- one's attention could be concentrated there. Regrettably, this is not
-
- so. Depending on the particular sort of traffic one is getting, for
-
- example, whether Telnet one-byte packets or file transfer maximum size
-
- packets at maximum speed, one can expect to see one or the other cost
-
- being the major bottleneck to throughput. Most implementors who have
-
- studied their programs in an attempt to find out where the time was
-
- going have reached the unsatisfactory conclusion that it is going
-
- 24
-
-
- equally to all parts of their program. With the possible exception of
-
- checksum processing, very few people have ever found that their
-
- performance problems were due to a single, horrible bottleneck which
-
- they could fix by a single stroke of inventive programming. Rather, the
-
- performance was something which was improved by painstaking tuning of
-
- the entire program.
-
-
- Most discussions of protocols begin by introducing the concept of
-
- layering, which tends to suggest that layering is a fundamentally
-
- wonderful idea which should be a part of every consideration of
-
- protocols. In fact, layering is a mixed blessing. Clearly, a layer
-
- interface is necessary whenever more than one client of a particular
-
- layer is to be allowed to use that same layer. But an interface,
-
- precisely because it is fixed, inevitably leads to a lack of complete
-
- understanding as to what one layer wishes to obtain from another. This
-
- has to lead to inefficiency. Furthermore, layering is a potential snare
-
- in that one is tempted to think that a layer boundary, which was an
-
- artifact of the specification procedure, is in fact the proper boundary
-
- to use in modularizing the implementation. Again, in certain cases, an
-
- architected layer must correspond to an implemented layer, precisely so
-
- that several clients can have access to that layer in a reasonably
-
- straightforward manner. In other cases, cunning rearrangement of the
-
- implemented module boundaries to match with various functions, such as
-
- the demultiplexing of incoming packets, or the sending of asynchronous
-
- outgoing packets, can lead to unexpected performance improvements
-
- compared to more traditional implementation strategies. Finally, good
-
- performance is something which is difficult to retrofit onto an existing
-
- 25
-
-
- program. Since performance is influenced, not just by the fine detail,
-
- but by the gross structure, it is sometimes the case that in order to
-
- obtain a substantial performance improvement, it is necessary to
-
- completely redo the program from the bottom up. This is a great
-
- disappointment to programmers, especially those doing a protocol
-
- implementation for the first time. Programmers who are somewhat
-
- inexperienced and unfamiliar with protocols are sufficiently concerned
-
- with getting their program logically correct that they do not have the
-
- capacity to think at the same time about the performance of the
-
- structure they are building. Only after they have achieved a logically
-
- correct program do they discover that they have done so in a way which
-
- has precluded real performance. Clearly, it is more difficult to design
-
- a program thinking from the start about both logical correctness and
-
- performance. With time, as implementors as a group learn more about the
-
- appropriate structures to use for building protocols, it will be
-
- possible to proceed with an implementation project having more
-
- confidence that the structure is rational, that the program will work,
-
- and that the program will work well. Those of us now implementing
-
- protocols have the privilege of being on the forefront of this learning
-
- process. It should be no surprise that our programs sometimes suffer
-
- from the uncertainty we bring to bear on them.
-
- 26
-
-
- Citations
-
-
- [1] Cohen and Postel, "On Protocol Multiplexing", Sixth Data
-
- Communications Symposium, ACM/IEEE, November 1979.
-
-
- [2] Bunch and Day, "Control Structure Overhead in TCP", Trends and
-
- Applications: Computer Networking, NBS Symposium, May 1980.
-
-
-